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Abstract 

We propose STARS, a randomized derivative-free algorithm for unconstrained opti¬ 
mization when the function evaluations are contaminated with random noise. STARS 
takes dynamic, noise-adjusted smoothing stepsizes that minimize the least-squares er¬ 
ror between the true directional derivative of a noisy function and its finite difference 
approximation. We provide a convergence rate analysis of STARS for solving convex 
problems with additive or multiplicative noise. Experimental results show that (1) 
STARS exhibits noise-invariant behavior with respect to different levels of stochastic 
noise; (2) the practical performance of STARS in terms of solution accuracy and con¬ 
vergence rate is significantly better than that indicated by the theoretical result; and 
(3) STARS outperforms a selection of randomized zero-order methods on both additive- 
and multiplicative-noisy functions. 


1 Introduction 


We propose STARS, a randomized derivative-free algorithm for unconstrained optimization 
when the function evaluations are contaminated with random noise. Formally, we address 
the stochastic optimization problem 


min fix) = Et 


/(®;0 , 


( 1 . 1 ) 


where the objective f(x) is assumed to be differentiable but is available only through noisy 
realizations /( x ; £). In particular, although our analysis will at times assume that the gradi¬ 
ent of the objective function f{x) exist and be Lipschitz continuous, we assume that direct 
evaluation of these derivatives is impossible. Of special interest to this work are situations 
when derivatives are unavailable or unreliable because of stochastic noise in the objective 
function evaluations. This type of noise introduces the dependence on the random variable 
£ in (1.1) and may arise if random fluctuations or measurement errors occur in a simula¬ 
tion producing the objective /. In addition to stochastic and Monte Carlo simulations, this 
stochastic noise can also be used to model the variations in iterative or adaptive simulations 
resulting from finite-precision calculations and specification of internal tolerances [14]. 
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Various methods have been designed for optimizing problems with noisy function eval¬ 
uations. One such class of methods, dating back half a century, are randomized search 
methods [11]. Unlike classical, deterministic direct search methods [1, 2, 4, 10, 20, 21], 
randomized search methods attempt to accelerate the optimization by using random vec¬ 
tors as search directions. These randomized schemes share a simple basic framework, allow 
fast initialization, and have shown promise for solving large-scale derivative-free problems 
[7, 19]. Furthermore, optimization folklore and intuition suggest that these randomized 
steps should make the methods less sensitive to modeling errors and “noise” in the general 
sense; we will systematically revisit such intuition in onr computational experiments. 

Recent works have addressed the special cases of zero-order minimization of convex 
functions with additive noise. For instance, Agarwahl et al. [3] utilize a bandit feedback 
model, but the regret bound depends on a term of order n 16 . Recht et al. [17] consider a 
coordinate descent approach combined with an approximate line search that is robust to 
noise, but only theoretical bounds are provided. Moreover, the situation where the noise 
is nonstationary (for example, varying relative to the objective function) remains largely 
unstudied. 

Onr approach is inspired by the recent work of Nesterov [15], which established complex¬ 
ity bounds for convergence of random derivative-free methods for convex and nonconvex 
functions. Such methods work by iteratively moving along directions sampled from a normal 
distribution surrounding the current position. The conclusions are true for both the smooth 
and nonsmooth Lipschitz-continuous cases. Different improvements of these random search 
ideas appear in the latest literature. For instance, Stich et al. [19] give convergence rates 
for an algorithm where the search directions are uniformly distributed random vectors in a 
hypersphere and the stepsizes are determined by a line-search procedure. Incorporating the 
Gaussian smoothing technique of Nesterov [15], Ghadimi and Lan [7] present a randomized 
derivative-free method for stochastic optimization and show that the iteration complexity 
of their algorithm improves Nesterov’s result by a factor of order n in the smooth, convex 
case. Although complexity bounds are readily available for these randomized algorithms, 
the practical usefulness of these algorithms and their potential for dealing with noisy func¬ 
tions have been relatively unexplored. 

In this paper, we address ways in which a randomized method can benefit from careful 
choices of noise-adjusted smoothing stepsizes. We propose a new algorithm, STARS, short 
for STepsize Approximation in Random Search. The choice of stepsize work is greatly 
motivated by More and Wild’s recent work on estimating computational noise [12] and 
derivatives of noisy simulations [13]. STARS takes dynamically changing smoothing stepsizes 
that minimize the least-squares error between the true directional derivative of a noisy 
function and its finite-difference approximation. We provide a convergence rate analysis of 
STARS for solving convex problems with both additive and multiplicative stochastic noise. 
With nonrestrictive assumptions about the noise, STARS enjoys a convergence rate for noisy 
convex functions identical to that of Nesterov’s random search method for smooth convex 
functions. 

The second contribution of onr work is a numerical study of STARS. Our experimental 
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results illustrate that (1) the performance of STARS exhibits little variability with respect 
to different levels of stochastic noise; (2) the practical performance of STARS in terms of 
solution accuracy and convergence rate is often significantly better than that indicated by 
the worst-case, theoretical bounds; and (3) STARS outperforms a selection of randomized 
zero-order methods on both additive- and multiplicative-noise problems. 

The remainder of this paper is organized as follows. In Section 2 we review basic as¬ 
sumptions about the noisy function setting and results on Gaussian smoothing. Section 3 
presents the new STARS algorithm. In Sections 4 and 5, a convergence rate analysis is pro¬ 
vided for solving convex problems with additive noise and multiplicative noise, respectively. 
Section 6 presents an empirical study of STARS on popular test problems by examining the 
performance relative to both the theoretical bounds and other randomized derivative-free 
solvers. 


2 Randomized Optimization Method Preliminaries 

One of the earliest randomized algorithms for the nonlinear, deterministic optimization 
problem 

min/(*), (2.1) 

where the objective function / is assumed to be differentiable but evaluations of the gradient 
V/ are not employed by the algorithm, is attributed to Matyas [11]. Matyas introduced 
a random optimization approach that, at every iteration fc, randomly samples a point x + 
from a Gaussian distribution centered on the current point Xk- The function is evaluated 
at x + = Xk + life, and the iterate is updated depending on whether decrease has been seen: 


%k +1 = 



if /(®+) < f(xk) 

otherwise. 


Polyak [16] improved this scheme by describing stepsize rules for iterates of the form 


Xk+1 %k 


, /Ofc + VkUk) - }{Xk) 
hk 'U'ki 

hk 


( 2 . 2 ) 


where hk > 0 is the stepsize, /z& > 0 is called the smoothing stepsize, and Uk € M n is a 
random direction. 

Recently, Nesterov [15] has revived interest in Poljak-like schemes by showing that Gaus¬ 
sian directions allow one to benefit from properties of a Gaussian-smoothed version 

of the function /, 

fn(x) = E u [f(x + im)\, (2.3) 

where /z > 0 is again the smoothing stepsize and where we have made explicit that the 
expectation is being taken with respect to the random vector u. 

Before proceeding, we review additional notation and results concerning Gaussian smooth¬ 
ing. 
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2.1 Notation 


We say that a function / 6 C 0,0 (R n ) if / : R n i-A R is continuous and there exists a constant 
Lq such that 

\f{x) - f(y)\< L 0 \\x-y\\, Vx,y e R", 

where || ■ || denotes the Euclidean norm. We say that / € C 1 ' 1 (R") if / : R n i—> R is 
continuously differentiable and there exists a constant L\ such that 

l|V/(x)-V/(y)|| < L,\\x -y\\ \/x,y e R n . (2.4) 

Equation (2.4) is equivalent to 

\f{y) - f(x) - (Vf(x),y - x)\ < y\\ x ~y\\ 2 Vx.yeK", (2.5) 

where (•, •) denotes the Euclidean inner product. 

Similarly, if x* is a global minimizer of / £ C 1,:L (R n ), then (2.5) implies that 

\\S7f(x)f <2L 1 (f(x)~ f(x*)) VxeR" (2.6) 

We recall that a differentiable function / is convex if 

f(y)>f(x) + (Vf(x),y-x) Vx,y€ R n . (2.7) 

2.2 Gaussian Smoothing 

We now examine properties of the Gaussian approximation of / in (2.3). For /i ^ 0, we let 
g/xix) be the first-order-difference approximation of the derivative of f(x) in the direction 
ueR n , 

f(x + im) - f(x) 

g^(x) = - u, 

g 

where the nontrivial direction u is implicitly assumed. By V/ /Lt (x) we denote the gradient 
(with respect to x) of the Gaussian approximation in (2.3). For standard (mean zero, 
covariance I n ) Gaussian random vectors u and a scalar p > 0, we define 

M p = E„[|Mf] = -2-^ / ||«|p e -3WI 2 du. (2.8) 

(27r) 2 Jun 

We summarize the relationships for Gaussian smoothing from [15] upon which we will 
rely in the following lemma. 

Lemma 2.1. Let u e R n be a normally distributed Gaussian vector. Then, the following 
are true. 

(a) For M p defined in (2.8), we have 

M p < n p / 2 , for pe [0,2], and 
M p < (n + p) P//2 , for p > 2. 
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(2.9) 

( 2 . 10 ) 




Algorithm 1 (STARS: STep-size Approximation in Randomized Search) 

1: Choose initial point x\, iteration limit N, stepsizes {hk}k> i- Evaluate the function at 
the initial point to obtain f(x i;£o)- Set k «— 1. 

2: Generate a random Gaussian vector Uk, and compute the smoothing parameter fik- 
3: Evaluate the function value f{xk + Hk u ki ffc)• 

4: Call the stochastic gradient-free oracle 




f(x k + [ik^k\ik) - K x k\ ffc-i) 




Uk- 


(3.1) 


5: Set — x/;; hkSfj, k (xk't Uk^k-, £fc-i)- 

6: Evaluate /(x^+i; £fc)> update G- + 1, and return to Step 2. 


(b) If / is convex, then 

U(x) > f(x) Mx € M n . (2.11) 

(c) If / is convex and / G C 1,:L (R n ), then 

\U(x)-f(x)\ < y L iU VxGRn - ( 2 . 12 ) 

(d) If / is differentiable at x, then 

E u [g^x)\ = Vf^x) Mx G r. (2.13) 

(e) If / is differentiable at x and / G G 1,:L (R n ), then 

E»[||fl/.(a)H 2 ] < 2(n + 4)||V/(x)|| 2 +^Lf(n + 6) 3 Mx € R n . (2.14) 

3 The STARS Algorithm 

The STARS algorithm for solving (1.1) while having access to the objective / only through 
its noisy version / is summarized in Algorithm 1. 

In general, the Gaussian directions used by Algorithm 1 can come from general Gaussian 
directions (e.g., with the covariance informed by knowledge about the scaling or curvature 
of /). For simplicity of exposition, however, we focus on standard Gaussian directions as 
formalized in Assumption 3.1. The general case can be recovered by a change of variables 
with an appropriate scaling of the Lipschitz constant (s). 

Assumption 3.1 (Assumption about direction u). In each iteration k of Algorithm 1, Uk is 
a vector drawn from a multivariate normal distribution with mean 0 and covariance matrix 
I n ; equivalently, each element of u is independently and identically distributed (i.i.d.) from 
a standard normal distribution, A/*(0,1). 
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What remains to be specified is the smoothing stepsize fik- It is computed by incor¬ 
porating the noise information so that the approximation of the directional derivative has 
minimum error. We address two types of noise: additive noise (Section 4) and multiplicative 
noise (Section 5). These two forms of how / depends on the random variable £ correspond 
to two ways that noise often enters a system. The following sections provide near-optimal 
expressions for and a convergence rate analysis for both cases. 

Importantly, we note Algorithm 1 allows the random variables £& and £k-i used in 
(3.1) to be different from one another. This generalization is in contrast to the stochastic 
optimization methods examined in [15], where it is assumed the same random variables are 
used in the smoothing calculation. This generalization does not affect the additive noise 
case, but will complicate the multiplicative noise case. 


4 Additive Noise 

We first consider an additive noise model for the stochastic objective function /: 

/(®;0 = /0) + i '( x ;£)> (4. i ) 

where / : W 1 R is a smooth, deterministic function, £ € S is a random vector with 
probability distribution P(£), and z/(x;£) is the stochastic noise component. 

We make the following assumptions about / and v. 

Assumption 4.1 (Assumption about /). / € C 1,:L (R n ) and f is convex. 

Assumption 4.2 (Assumption about additive v). 

1. For all x E R n , v is i.i.d. with bounded variance = Var{y(x\£)) > 0. 

2. For all x E the noise is unbiased; that is, E^[z/(a;;£)] = 0. 

We note that is independent of x since z/(a;;£) is identically distributed for all x. The 
second assumption is nonrestrictive, since if E^[z/(x;£)] ^ 0, we could just redefine f(x) to 
be f(x) -E ? [i/(x;£)]- 


4.1 Noise and Finite Differences 


More and Wild [13] introduce a way of computing the smoothing stepsize /z that mitigates 
the effects of the noise in / when estimating a first-order directional directive. The method 
involves analyzing the expectation of the least-squared error between the forward-difference 
approximation, / 0 c +i m »£i)-/( a; »£ 2 ) anc [ the directional derivative of the smooth function, 
{Vf(x),u). The authors show that a near-optimal /z can be computed in such a way that 
the expected error has the tightest upper bound among all such values /a. Inspired by their 
approach, we consider the least-square error between /( g +^^d~/( a; ^ 2 ) u an< ^ f(x),u)u. 
That is, our goal is to find /z* that minimizes an upper bound on E[£(/z)], where 


£(a 0 = £(/i;z,w,£ i,6) 


/(® + /ziz;£i) — /(a?;^ 2 )„. 

UL 

A* 


(X7f(x),u)u 


2 
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We recall that u , £ 1 , and £2 are independent random variables. 

Theorem 4.3. Let Assumptions 3.1, 4-1, and 4-2 hold. If a smoothing stepsize is chosen 
as 


h = 


8 <%n 


_L\(n + 6) 3 J 


then for any x G R n ; we have 

e u,6,&[£(a 0] ^ v / 2Ticr a i/n(n + 6) 3 . 
Proof Using (4.1) and (2.5), we derive 


(4.2) 

(4.3) 


£(m) < 


v(x + (iu;£i) - v{x;&) . !>Li „ ll2 

- u-\ -— m u 


< 


[l 

u 


xx 


/X 2 

Let X = _ j _ a^i || w || 2 . By Assumption 4.2, the expectation of X with respect 

to £1 and £2 is = ^ i || j u|| 2 , and the corresponding variance is Var(X) = It 

then follows that 

E ?1j 6 [X 2 ] = (E 6& [X]) 2 + Var(X) = ^ \\u \\ 4 + 

^ /X 

Hence, taking the expectation of £Qx) with respect to xx,£i, and £2 yields 

!/,£i,£2 (aO] — 1^11 ]] 

llxxll 6 4-^llxxl 


yili. .116 , 2cr aii..n2 
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Using (2.9) and (2.10), we can further derive 


2a? 


^.e 2 [X(/i)]<^(« + 6) 3 +^n. 

^ /X 


(4.4) 


The right-hand side of (4.4) is uniformly convex in /x and has a global minimizer of 

1 


/X = 


8cr 2 n 


_L\(n + 6) 3 J 

with the corresponding minimum value yielding (4.3). 


□ 


Remarks: 

• A key observation is that for a function f(x\tf) with additive noise, as long as the noise 
has a constant variance a a > 0, the optimal choice of the stepsize /i* is independent 
of x. 

• Since the proof of Theorem 4.3 does not rely on the convexity assumption about /, the 
error bound (4.3) for the finite-difference approximation also holds for the nonconvex 
case. The convergence rate analysis for STARS presented in the next section, however, 
will assume convexity of /; the nonconvex case is out of the scope of this paper but 
is of interest for future research. 
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4.2 Convergence Rate Analysis 


We now examine the convergence rate of Algorithm 1 applied to the additive noise case 
of (4.1) and with fi k = g* for all k. One of the main ideas behind this convergence proof 
relies on the fact that we can derive the improvement in / achieved by each step in terms 
of the change in x. Since the distance between the starting point and the optimal solution, 
denoted by R = ||xo — x*\\, is finite, one can derive an upper bound for the “accumulative 
improvement in /,” ]vTT SfcLo (^if( x k)] — /*)• Hence, we can show that increasing the 
number of iterations, A, of Algorithm 1 yields higher accuracy in the solution. 

For simplicity, we denote by E[-] the expectation over all random variables (i.e., E[-] = 
unless otherwise specified. Similarly, we denote s /Jjk (x k ; u k , £k-i) m 
(3.1) by s^ fc . The following lemma directly follows from Theorem 4.3. 

Lemma 4.4. Let Assumptions 3.1, 4.1, and 4.2 hold. If the smoothing stepsize g k is set to 
the constant g* from (4.2), then Algorithm 1 generates steps satisfying 

E[||^J 2 ]<2(n + 4)||V/(x fc )|| 2 + C 2 , 
where C 2 = 2\f2L\<j n(n + 6) 3 . 

Proof. Let go(x k ) = (Vf(x k ), u k) u k- Then (4.3) implies that 

E[jiM| 2 - 2(s n ,g 0 (x k )) + ll9o(zfc)|| 2 ] < Cl, (4.5) 

where C\ = \j2L\cr a ^/n(n + 6) 3 . 

The stochastic gradient-free oracle s^ k in (3.1) is a random approximation of the gradient 
V/(xfc). Furthermore, the expectation of s^ k with respect to £ k and yields the forward- 
difference approximation of the derivative of / in the direction u k at x k : 

TV r I; f{%k + VkUk) ~ f{x k ) 

%,&-i l s »J =--- u k = 9,u(x k ). (4.6) 

Combining (4.5) and (4.6) yields 


E 



< E[2{s tlki g 0 (x k )} - \\go(x k )\\ 2 } + C\ 

( = 6) E Uk [2(g^(x k ),g 0 (x k )) - \\go(x k )\\ 2 ] + Ci 

= ^u k [-\\go{x k ) - g^(x k )\\ 2 + \\g^x k )\\ 2 ] P Cl 

— ^u k [HsF^fc) II ]~\~Ci 

( 2 . 14 ) 

< 2(n + 4)||V/(xfc)jj A C 2 , 


where C 2 = C\ + ^L\(n + 6) 3 = 2\[2L\<j a \Jn(n + 6) 3 . 


□ 


We are now ready to show convergence of the algorithm. Denote x* £ R n a minimizer 
associated with f* = f(x*). Denote by U k = {wi, • • • , u k } the set of i.i.d. random variable 
realizations attached to each iteration of Algorithm 1. Similarly, let V k = {£o> • • • , £&}. 
Define </> 0 = f{x 0 ) and f> k = E u k - U v k -Af ( x k)] for k> 1. 







Theorem 4.5. Let Assumptions 3.1, 4-1, and 4-2 hold. Let the sequence {xk}k >o be gen¬ 
erated by Algorithm 1 with the smoothing stepsize Hk set as /a* in (4-2). If the fixed step 
length is hk = h = 4Ll ^ +4 ) f or all k, then for any N > 0, we have 

N 


N 


1 V'Vjl ^*\^4Li(n + 4) *|| 2 , 3^2 

7v+l II +^-o- a (n + 4). 


k =0 


Proof. We start with deriving the expectation of the change in x of each step, that is, 
E[r^ + J - r\, where rk = ||Xk ~ x*\\. First, 


r k+i = I \ x k ~ h k s 


*l|2 


Uk 


- r ‘k - 2h k {s tik ,x k - x*) + h 2 k \\s 


Uk i 


E[s^ fc ] can be derived by using (2.13) and (4.6). E[||s^ fc || 2 ] is derived in Lemma 4.4. Hence, 


E[ri+i] < r 2 k -2h k (Vf ll (x k ),x k -x*) + h 2 k [2(n + 4)\\Vf(x k )\\ 2 + C 2 ]. 

By using (2.7), (2.11), and (2.6), we derive 

E[r2+J < rl-2h k (f(x k )-f ll (x*)) + 4hlL 1 (n + 4)(f(x k )-f(x*)) + hlC 2 . 

Combining this expression with (2.12), which bounds the error between f^(x) and f(x), we 
obtain 

E[ri +1 ] < r 2 k -2h k (l-2h k L 1 (n + 4))(f(x k )-n + C 3 , 

where C 3 = h\C 3 + ‘2h k .^-Lin = h'l ; C't + 2\Fih k G a 
Let h k = h= 4Ll( 1 n+4) . Then, 


E[rl +1 ] < r\ 


2 f( x k) ~ f* 


C 3 , 


(4.7) 


4Li(n + 4) 

where C 3 = ^^ 9 \{n) and gi(n) = - By showing that g[{n) < 0 

for all n > 10 and g[(n) > 0 for all n < 9, we can prove that g\(n) < max{g(9),g(10)} = 
max{0.2936,0.2934} < 0.3. Hence, C 3 < 

Taking the expectation in Uk and Vk , we have 

3 \J‘2a a 


E U k ,P k [r k+ 1] <E Uk - u v k -A r k\ 


<l>k ~ f* 


4Li(n + 4) 20Li 

Summing these inequalities over k = 0, ■ ■ ■ ,1V and dividing by N +1, we obtain the desired 
result. fif 

The bound in Theorem 4.5 is valid also for (/>n = E^ fc -i,V k -i [/(^jv)] 5 where xn = 
argmin x {/(:r) : x G {^o 5 ■ ■ ■ , In this case, 

N 


l u k - i/Pfc-1 [/(®JNT)] - f* < E Wfc _ 1) ^ fc _ 1 

< 4Li(n + 4) 

IV+ 1 


N 




k =0 


\\Xq - X* ||^ + —2—(Taijl + 4). 


3\/2 
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Hence, in order to achieve a final accuracy of e for (js^ (that is, < e), the allowable 

5e 

absolute noise in the objective function has to satisfy cr a < —-= -. Furthermore, under 

J * 6^2 {n + 4) 

this bound on the allowable noise, this e accuracy can be ensured by STARS in 


jv= 8(n + 4 )L 1 ^ _ 1 ^q^ L iR2 ^ 


(4.8) 


iterations, where R 2 is an upper bound on the squared Euclidean distance between the 
starting point and the optimal solution: \\xo — x *\\ 2 < R 2 . In other words, given an 
optimization problem that has bounded absolute noise of variance a 2 , the best accuracy 
that can be ensured by STARS is 


^ 6y/2(r a (n + 4) 
e pred — c 


(4.9) 


and we can solve this noisy problem in O ( - L\R 2 j iterations. Unsurprisingly, a price 

\ e pred / 

must be paid for having access only to noisy realizations, and this price is that arbitrary 
accuracy cannot be reached in the noisy setting. 


5 Multiplicative Noise 

A multiplicative noise model is described by 

f(x; 0 = f( x ) [1 + v(x; £)] = f{x) + f(x)v(x; £). (5.1) 

In practice, \v\ is bounded by something smaller (often much smaller) than 1. A canonical 
example is when / corresponds to a Monte Carlo integration, with the a stopping criterion 
based on the value f(x). Similarly, if / is simple and computed in double precision, the 
relative errors are roughly 10 -16 ; in single precision, the errors are roughly 10 -8 and in half 
precision we get errors of roughly 10 -4 . 

Formally, we make the following assumptions in our analysis of STARS for the problem 
(1.1) with multiplicative noise. 

Assumption 5.1 (Assumption about /). f is continuously differentiable and convex and 
has Lipschitz constant Lq. V/ has Lipschitz constant L\. 

Assumption 5.2 (Assumption about multiplicative v). 

1. v is i.i.d., with zero mean and bounded variance; that is, E [v] = 0, a 2 = Var{v ) > 0. 

2. The expectation of the signal-to-noise ratio is bounded; that is, ] < b. 

3. The support of v (i.e., the range of values that v can take with positive probability) is 
bounded by ±a, where a < 1. 

The first part of Assumption 5.2 is analogous to that in Assumption 4.2 and guarantees 
that the distribution of v is independent of x. Although not specifying a distributional form 
for is (with respect to £), the final two parts of Assumption 5.2 are made to simplify the 
presentation and rule out cases where the noise completely corrupts the function. 
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5.1 Noise and Finite Differences 


Analogous to Theorem 4.3, Theorem 5.3 shows how to compute the near-optimal stepsizes 
in the multiplicative noise setting. 

Theorem 5.3. Let Assumptions 5.1 and 5.2 hold. If a forward-difference parameter is 
chosen as 


H* = C 4 ^/\f(x)\, where C 4 = 
then for any x € M n we have 


16cr?n 


L\ (1 + 3rf) (n + 6) 3 J 


^ 2Licr r yJ (1 + 3 tf)n(n + 6 ) 3 \f(x)\ + 3Lo^(n + 4) 2 . 
Proof. By using (5.1) and (2.5), we derive 


(5.2) 


£(m) < 


f(x + nu)v(x + - f(x)v(x-,&) . l*Li 2 

- u-\ -—hi u 


< 


/i 

f(x +im)v(x +plu-,£,i) - f(x)v(x;£ 2 ) liLi 2 

' o bd 

pi 2 


m 


Again applying (2.5), we get S(fi) < X 2 ||ii|| 2 , where 


X = 


< 


f(x +fjm)i/(x + fjiu;£i) - f(x)v(x;&) , ,, ,,2 


li 


u\ 


/O) 


+ \7f( x ) T u+ ^||ii|| 2> ) i/(x+ /iii;^i) - ^^zi(z;£ 2 ) 
V L ^ J A 4 


/iLi 


m 


The expectation of X with respect to £1 and £2 is 


w*] = ^INI 2 


and the corresponding variance is 

7 0 ) , 


Var(X) = 


Vf(x) T u 


< 


/i 

3 / 2 7 ) 

M 2 

4 / 2 7 ) 


/iLi 


hi <x 


/ 2 (^) 


2 a r 


11 

liiir 1 + 


T \2 , 3/i 2 L 2 ||4 \ 2 , f 2 ( x ) 2 




+ 3(V/7) i w) 2 + 


+ 3(V/(x) T «) 2 + 3 ^ 2i? "-" 4 '- 2 




2 


hi cr, 


where the inequality holds because (a + b + c) 2 < 3a 2 + 35 2 + 3c 2 for any a, 5, c. Since 
E[X 2 ] = Var(X) + (E[X]) 2 , we have that 


Efrf 2 ! < 


,2 T 2 ( 


.2t2( 


< 


m4 


4<t 2 

ii 4 

+ 

7# 

n4 


4cr 2 

u 4 

+ 

7# 
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\T \ 2 2 


-2 2ii m2 


4 
































Hence, we can derive 

EM < E U [E ? 1 , 6 [X 2 |M| 2 ]] 
= E u [||«|| 2 E ei , ? 2 [X 2 ]] 

< E > 2L ?( 1 + 3 ^ 2 ), 


m || 6 + ^f 2 (x)\\ur + ZLqU 2 \\u 


2 2 1 




By using (2.9), (2.10), and this last expression, we get 


E[f(/t)] < 


^L\{1 + 3 (n + 6) 3 + ^/V) + 3i 2 a 2 (n + 4)2 . 


4 fJL* 

The right-hand side of this expression is uniformly convex in fi and attains its global 
minimum at p* = C 4 ^/|/(;c)|; the corresponding expectation of the least-squares error is 

e u,£i,&[£(A 4 *)] < 2Licr r y/(l + 3of)n(n + 6 ) 3 |/(^)| +3Lo^(n + 4) 2 . 

□ 

Unlike for the absolute noise case of Section 4, the optimal fi value in Theorem 5.3 
is not independent of x. Furthermore, letting ya/e = M* = ^vw)I assumes that / is 
known. Unfortunately, we have access to / only through /. However, we can compute an 
estimate, A, of ya* by substituting / with / and still derive an error bound. To simplify 
the derivations, we introduce another random variable, £ 3 , independent of £1 and £ 2 , to 
compute A = ji(x;£ 3 ). The goal is to obtain an upper bound on E^ 3 [Ef 1 ? £ 2 }W [£ (A)]], where 

/(® + A;fi) -/(®;6) 


£(A) = = 


ii 


-u - {Vf(x),u)u 


This then allows us to proceed with the usual derivations while requiring only an additional 
expectation over £ 3 . 

Lemma 5.4. Let Assumptions 5.1 and 5.2 hold. If a forward-difference parameter is chosen 
as 


A = |/(.t;£ 3 )|, where C 4 = 

then for any x £ R n , we have 


16oyn 


_L\(l + 3of)(n + 6) 3 J 
&&&[£(£)] ^ (! + 6)£i<W(l + 3o-2)n(n + 6 ) 3 |/(a:)| + 3Lo<x 2 (ri + 4 ) 2 - 


(5.3) 

(5.4) 


Proof. 

mm = %[ E ^ ls e 2 [^(A)]] 

A 2 T 2 (1 + 3<t 2 ) 


< E& 


4 


'n + 6) 3 + ^f^f 2 (x) + 3Lo<T 2 (n + 4 ) 2 

li 


= £i<W (1 + 3cr2)n(n + 6) 3 |/(x)|E| 3 1 + ^£ 3 ) + | 

1 — r l/\X) 

< (1 + b)L\a r \J (1 + 3 CF^)n(n + 6) 3 |/(:c)| + 3 Lgcr 2 (n + 4) 2 , 


6 ). 


+ 3Tq(j 2 (?7- + 4 ) 2 


where the last inequality holds by Assumption 5.2 because the expectation of the signal-to- 
noise ratio is bounded by b. □ 
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Remark: Similar to the additive noise case, Theorem 5.3 and Theorem 5.4 do not require 
/ to be convex. Hence, (5.2) and (5.4) both hold in the nonconvex case. However, the 
following convergence rate analysis applies only to the convex case, since Lemma 5.6 relies 
on a convexity assumption for /. 


5.2 Convergence Rate Analysis 

Let fik = fi = C±yJ\f(xk‘, £fc')| in Algorithm 1 . Before showing the convergence result, we 
derive E[(s^, x^—x*)] and E[||s^|| 2 ], where sjx denotes s^(xk\Uk , £&, £fc-i, £&') an d E[-] denotes 
the expectation over all random variables Uk,£ki£k- 1 , an d £k'( i- e -, E[*] = ^u k ,^ k ,^ k _i4u [•;])? 
unless otherwise specified. 

Lemma 5.5. Let Assumptions 5.1 and 5.2 hold. If = fi = J\f(xk\ £fc/)|, tHen 


E IIMI 2 ] < 2 (n + 4)||V /(^)|| 2 + C 8 |/(x fc )| + C 6) 
where C 5 = ^C|Lf(n + 6) 3 + (1 + 6 )Li<x r y(l + 3<x 2 )n(n + 6) 3 and C6 = 3LQ<r 2 (n + 4) 2 . 
Proof. Let go(xf~) = (Vf(xk),Uk)uk- The bound (5.3) in Theorem 5.4 implies that 


E [||sjx - go{x k )\\ 2 ] < (1 + 6 )Licr r y/(l + 3 <T%)n(n + 6) 3 \f(x)\ + 3Lo<r 2 (n + 4 ) 2 = ^(z). 

Hence, 


< [ E « fc ,&,&-i[ 2 < s A‘>flo(xfc)) - IlsoOib)II 2 ]] + A 1 ) 

( = 6) %/ [ E u k [ 2 (9^( x k),9o{x k )) - ll9o(a:fc)|| 2 ]] + ^(x) 

< e 4' [ Ew fc [ll^fe l| 2 ]] + ^0) 

(2.14) 


<’ 2(n + 4)||V/(x ,)|| 2 + E 4 , 

= 2 (n + 4)|| V/(ar fc )|| 2 + C 8 |/(x t )[ + C 6 , 


r fif(« + 6 3 ) 




where the last equality holds since [/$ = E^, [Cl\f(x k )\(l + v(x k \&#)] = Cj\f(x k )\. □ 
Lemma 5.6. Let Assumptions 5.1 and 5.2 hold. If /.(;,• = fi = C\\/\f(xi ; : 5 /j)|, then 


n {sfi,x k -x*)]>f(x k )-r 


ClLin 

2 


\f( x k)\- 
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Proof. First, we have 


- J Uk£k£k-l IvA 


sr, = 


- J 'Uk£k,€k-1 


- J V'k£k£k -1 


f(xk + VkUk] &) - f{ x k ; ffc-l) 


Tk 


Uk 


f(x k + /ifeWfe)[i + z'Ofc + mawSOI - /(®fc)[i + K^;&-i)] 




Uk 


f[x k + jUfc^Jfe) - f( x k) 


(2.13) 


ffj,k( x k)- 


l^k 


Uk 


Then, we get 


^Ufc,£fc,£fc-1 K S £’ x )] — 

(V/ /Jfc (.Xi..),.x fe -V) 

(2.7) 


> 

//ik(*fc) - /»(V) 

(2.11) 


> 

/(*fc) “ //**(**) 

(2.12) 

> 

/(**) - /* - y^i"- 


Since /i k = fi = C 4 y |/(x fc ; f fc /)|, we have 

E[(s A ,Z fc -®*)] = ^fc/PEufc^fc^fc-iK^^fe -®*>]] ^ f(°°k) -f*~ C 4 ^ in \fi x k)\- 


□ 


We are now ready to show the convergence of Algorithm 1, with fi k = jd, for the mini¬ 
mization of a function (5.1) with bounded multiplicative noise. 

Theorem 5.7. Let Assumptions 5.1 and 5.2 hold. Let the sequence {x k }k> o be generated 
by Algorithm 1 with the smoothing parameter pL k being 

li k N;A = C 4 ^/\f{x-^ k ')\ 

and the fixed step length set to h k = h = 4Li ^ +4 ) for all k. Let M be an upper bound on 
the average of the historical absolute values of noise-free function evaluations; that is, 

M ~ N + 1 ^ ^ = AT + 1 ( + E E ^-iA-i [\f( x k)\] 

^ k =0 ^ V k=1 

Then, for any N > 0 we have 

jf— - n < 4J f^ ll*o - VII 2 + 4Li (n + 4) (C 7 M + C s ), (5.5) 

+ 1 k =0 + 

C r = 4^3) + lel^TSF 0nd Cs = 16L 2 (n+4) 2 ' 
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Proof. Let r k = \\x k — x*\\. First, 

r k+i = \\%k-h k Sji-x*\\ 2 

= r\ - 2 h k {s^x k ~ x*) + hlWs^W 2 . 

E[(s£, Xk~x*)] and E[||s^|| 2 ] are derived in Lemma 5.6 and Lemma 5.5, respectively. Hence, 
incorporating ( 2 . 6 ), we derive 

E H+i] < rl- 2 h k (f(x k ) -f*- <$£l\f( Xk )\) + hl[2(n + 4)||V/(x t )|l a + C B |/(x*)l + C 6 ] 

< 4 - 2h k (l - 2h k Li(n + mf(x k ) ~ D + (h k ClLm + h 2 k C 5 )\f(x k )\ + h 2 k C 6 . 

Let h k = 4 £, 1 fr t+ 4 ) • Then, taking the expectation with respect to U k = {wi, • • • , u k } and 
Pk = {£o,£o>£i>£i '5 • • • >ffc} yields 

E u k Pk [ r l+ 1] - VI] ~ ^ + CV|<fe| + Cg. 

Summing these inequalities over k = 0, • • • , IV and dividing by N + 1, we get 

^ 4L /V+ + 1 4) l|a ° ~ ^ l|2 + 4£l(n + 4)(C7M + Cs) - 

^ 1 fc =0 

□ 


The bound (5.5) is valid also for <f>N = [/(xn)], where xn = arg min x {/(a;) : 

x G {a;o, • • • 5 In this case, 


[/(**)] - /* 


< ^Uk-uV^t 

< 4Li(n + 4) 

iV + 1 


IV 


1 ^ 

tyEw*-/*) 


fc =0 

„*i|2 


I|a?o - ®*f + 4Li(rc + 4)(C 7 M + C 8 ). (5.6) 


Let us collect and simplify the constants Cf and Cg. 
Second, since 


First ’ c » = 


Cf, = \clL\(n + 6) 3 + (1 + b)L\a T \/ (1 + + 6) 3 

Li 

= 2Li(T r ^Y +^ 2 y/n(n + 6) 3 + (1 + b)L\o r \/ (1 + 3cr 2 )n(n + 6) 3 
< {b + 3)Licr r y/l + 3cr 2 y/n[n + 6 ) 3 , 


3^o 

l6Lf • 


where the last inequality holds because — 4 — 4 + 3of, we can derive 

r = Cjn C 5 

7 4(n + 4) 16L 2 (n + 4 ) 2 

< 1 / of n I n (i b + 3)cr r y/l + 3cr 2 y/n(n + 6) 3 

— Li y 1 + 3of n + 4 y (n + 6) 3 ^ 16Li (n + 4 ) 2 

< [ 92 ( n ) + (6 + 3)g 3 {n )], 

Ti 
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where g 2 (n) = 93 = ^(n+A) 2 > and the last inec L ualit y a g ain utilizes < 

1 < 1 + 3a 2 . It can be shown that g' 2 {n) < 0 for all n > 8 and g' 2 (n) > 0 for all n < 7, 
thus g 2 (n) < max{g(7), g(S)} = max{0.0359,0.0360} < Similarly, one can prove that 
< 73 ( 12 ) = 0 , g' 3 (n) < 0 for all n > 12 , and g 3 (n) > 0 for all n < 12 , which indicates 
<73(72) < < 73 ( 12 ) « 0.0646 < A. Hence, 


C7 < 


3(26 +7)cr r ^1 + 3(72 ^ 3v / 3(26 + 7)(o-2 + f) 
64Li “ 64Li 


j 2 + 1 

J r ' 6 ' 


where the last inequality holds because & r \J\ + a 2 < crj; 

With C 7 and Cs simplified, (5.6) can be used to establish an accuracy e for <j>N\ that 
is, 4>n ~ f* < e, can be achieved in O (^—LiR 2 ^J iterations, provided the variance of the 
relative noise a 2 satisfies 


4Li(n + 4) [CjM + Cg) < —Cg((j^ + — ){n + 4) < —, 


6' 


where Cg = ^^(26 + 7)M + that is, 


cr? < 


1 

6 ' 


C 9 (ra + 4) 6 ' 

The bound in (5.7) may be cause for concern since the upper bound may only be positive 
for larger values of e. Rearranging the terms explicitly shows that the additive term f is a 
limiting factor for the best accuracy that can be ensured by this bound: 


e pred - C ^ a 


\)(.n 


4). 


(5.8) 


6 Numerical Experiments 

We perform three types of numerical studies. Since our convergence rate analysis guarantees 
only that the means converge, we first test how much variability the performance of STARS 
show from one run to another. Second, we study the convergence behavior of STARS in both 
the absolute noise and multiplicative noise cases and examine these results relative to the 
bounds established in our analysis. Then, we compare STARS with four other randomized 
zero-order methods to highlight what is gained by using an adaptive smoothing stepsize. 

6.1 Performance Variability 

We first examine the variability of the performance of STARS relative to that of Nesterov’s 
RG algorithm [15], which is summarized in Algorithm 2 . One can observe that RG and 
STARS have identical algorithmic updates except for the choice of the smoothing stepsize 
Hk- Whereas STARS takes into account the noise level, RG calculates the smoothing stepsize 
based on the target accuracy e in addition to the problem dimension and Lipschitz constant, 

(6 - 1} 
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Algorithm 2 (RG: Random Search for Smooth Optimization) 

1: Choose initial point xq and iteration limit N. Fix step length = h = 4 ( n + 4 ) Ll and 
compute smoothing stepsize based on e = 2 -16 . Set k <— 1. 

2: Generate a random Gaussian vector u^. 

3: Evaluate the function values f{xk‘,£k) and f{xk + ^k u k]^k)- 
4: Call the random stochastic gradient-free oracle 

, t \ K x k + VkUk\ik) - f( x k ; 4 ) 

Slx\ x k-)'U , ki<ik) ^ k ■ 

Hk 

5: Set ajfc+i = Xk — hkS^{xk\Uk^k)^ update fc fc + 1, and return to Step 2. 


MATLAB implementations of both RG and STARS are tested on a smooth convex func¬ 
tion with random noise added in both additive and multiplicative forms. In our tests, we 
use uniform random noise, with v generated uniformly from the interval [—a/3ct, v / 3c r ] by 
using MATLAB’s random number generator rand. This choice ensures that v has zero mean 
and bounded variance a 2 in both the additive (cr a = <r) and multiplicative cases (<r r = a) 
and that Assumptions 4.2 and 5.2 hold, provided that cr < 3 -1 / 2 . 

We use Nesterov’s smooth function as introduced in [15]: 

i i ] i 

h ( x ) = -( a : (1) ) 2 + - ^(^ (i+1) - ^) 2 + (6.2) 


where denotes the zth component of the vector x £ R n . The starting point specified for 
this problem is the vector of zeros, xq = 0. The optimal solution is 


= 1 - 


n + 


j,i = !,••• ,n; f(x*) = - 


n 


2(n + 1) ’ 


The analytical values for the parameters (corresponding to Lipschitz constant for the 
gradient and the squared Euclidean distance between the starting point and optimal so- 

lution) are: L\ < 4 and R 2 = ||jco — ^ll 2 ^ -• Both methods were given the same 

O 

parameter value (4.0) for Li, but the smoothing stepsizes differ. Whereas RG always uses 
fixed stepsizes of the form (6.1), STARS uses fixed stepsizes of the form (4.2) in the absolute 
noise case and uses dynamic stepsizes calculated as (5.3) in the multiplicative noise case. To 
observe convergence over many random trials, we use a small problem dimension of n = 8; 
however, the behavior shown in Figure 6.1 is typical of the behavior that we observed in 
higher dimensions (but the n = 8 case requiring fewer function evaluations). 

In Figure 6.1, we plot the accuracy achieved at each function evaluation, which is the 
true function value f(xk) minus the optimal function value f(x*). The median across 20 
trials is plotted as a line; the shaded region denotes the best and worst trials; and the 
25% and 75% quartiles are plotted as error bars. We observe that when the function is 
relatively smooth, as in Figure 6.1(a) when the additive noise is 10 —6 , the methods exhibit 
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(a) (7 a = 1(T 6 


(b) (7a = 1(T 3 




(c) (Jr = 10 6 


(d) (J r = 10 3 


Figure 6.1: Median and quartile plots of achieved accuracy with respect to 20 random seeds 
when applying RG and STARS to the noisy fi function. Figures 6.1(a) and 6.1(b) show the 
additive noise case, while Figures 6.1(c) and 6.1(d) show the multiplicative noise case. 

similar performance. As the function gets more noisy, however, as in Figure 6.1(b) when 
the additive noise becomes 10 -4 , RG shows more fluctuations in performance resulting in 
large variance, whereas the performance STARS is almost the same as in the smoother case. 
The same noise-invariant behavior of STARS can be observed in the multiplicative case. 

6.2 Convergence Behavior 

We tested the convergence behavior of STARS with respect to dimension n and noise levels 
on the same smooth convex function fi with noise added in the same way as in Section 6.1. 
The results are summarized in Figure 6.2 , where (a) and (b) are for the additive case and (c) 
and (d) are for the multiplicative case. The horizontal axis marks the problem dimension 
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(d) (7r = 1(T 6 


Figure 6.2: Convergence behavior of STARS: absolute accuracy versus dimension n. Two 
absolute noise levels (a) and (b), and two relative noise levels (c) and (d) are presented. 

and the vertical axis shows the absolute accuracy. Two types of absolute accuracy are 
plotted. First, 6p rec [ (in blue x’s) is the best achievable accuracy given a certain noise 
level, computed by using (4.9) for the additive case and (5.7) for the multiplicative case. 
Second is the actual accuracy (in red circle) achieved by STARS after N iterations where 
N , calculated as in (4.8), is the number of iterations needed in theory to get ep rec p Because 
of the stochastic nature of STARS, we perform 15 runs (each with a different random seed) 
of each test and report the averaged accuracy 

Actual = T 5 t 4tual = ^ E(/(^) " /*)• ( 6 - 3 ) 

i =1 i =1 

We observe from Figure 6.2 that the solution obtained by STARS within the iteration 
limit N is more accurate than that predicted by the theoretical bounds. The difference be- 
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tween predicted and achieved accuracy is always over an order of magnitude and is relatively 
consistent for all dimensions we examined. 

6.3 Illustrative Example 

In this section, we provide a comparison between STARS and four other zero-order algo¬ 
rithms on noisy versions of (6.2) with n = 8 . The methods we study all share a stochastic 
nature; that is, a random direction is generated at each iteration. Except for RP [19], which 
is designed for solving smooth convex functions, the rest are stochastic optimization algo¬ 
rithms. However, we still include RP in the comparison because of its similar algorithmic 
framework. The algorithms and their function-specific inputs are summarized in Table 6.1, 
where L\ and a 2 are, respectively, estimations of L\ and a 2 given a noisy function (details 
on how to estimate L\ and a 2 are discussed in Appendix). We now briefly introduce each 
of the tested algorithms; algorithmic and implementation details are given in the appendix. 

Table 6.1: Relevant function parameters for different methods. 


Method Abbreviation 

Method Name 

Parameters 

STARS 

Stepsize Approximation in Random Search 

Li,cr 2 

SS 

Random Search for Stochastic Optimization [15] 

L 0 ,R 2 

RSGF 

Random Stochastic Gradient Free method [7] 

Lx, a 2 

RP 

Random Pursuit [19] 

- 

ES 

(l+l)-Evolution Strategy [18] 

- 


The first zero-order method we include, named SS (Random Search for Stochastic Opti¬ 
mization), is proposed in [15] for solving (1.1). It assumes that / € C 0 , 0 (M n ) is convex. The 
SS algorithm, summarized in Algorithm 3, shares the same algorithmic framework as STARS 
except for the choice of smoothing stepsize Hk and the step length hk- It is shown that the 
quantities and hk can be chosen so that a solution for ( 1 . 1 ) such that /(xjv) — f* < e 
can be ensured by SS in 0(n 2 /e 2 ) iterations. 

Another stochastic zero-order method that also shares an algorithmic framework similar 
to STARS is RSGF [7], which is summarized in Algorithm 4. RSGF targets the stochastic 
optimization objective function in (1.1), but the authors relax the convexity assumption 
and allow / to be nonconvex. However, it is assumed that /(•,£) £ C 1 , 1 (M n ) almost surely, 
which implies that / E C 1 , 1 (R n ). The authors show that the iteration complexity for RSGF 
finding an e-accurate solution, (i.e., a point x such that E[||V/(x)||] < e) can be bounded 
by 0(n/e 2 ). Since such a solution x satisfies f(x) — f* < e when / is convex, this bound 
improves Nesterov’s result in [15] by a factor n for convex stochastic optimization problems. 

In contrast with the presented randomized approaches that work with a Gaussian vector 
u, we include an algorithm that samples from a uniform distribution on the unit hyper¬ 
sphere. Summarized in Algorithm 5, RP [19] is designed for unconstrained, smooth, convex 
optimization. It relaxes the requirement in [15] of approximating directional derivatives via 
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a suitable oracle. Instead, the sampling directions are chosen uniformly at random on the 
unit hypersphere, and the step lengths are determined by a line search oracle. This ran¬ 
domized method also requires only zeroth-order information about the objective function, 
but it does not need any function-specific parametrization. It was shown that RP meets the 
convergence rates of the standard steepest descent method up to a factor n. 

Experimental studies of variants of (1 + l)-Evolution Strategy (ES), first proposed by 
Schumer and Steiglitz [18], have shown their effectiveness in practice and their robustness in 
noisy environment. However, provable convergence rates are derived only for the simplest 
forms of ES on unimodal objective functions [5, 8, 9], such as sphere or ellipsoidal functions. 
The implementation we study is summarized in Algorithm 6; however, different variants of 
this scheme have been studied in [6]. 

We observe from Figure 6.3 that STARS outperforms the other four algorithms in terms 
of final accuracy in the solution. In both Figures 6.3(a) and 6.3(b), ES is the fastest 
algorithm among all in the beginning. However, ES stops progressing after a few iterations, 
whereas STARS keeps progressing to a more accurate solution. As the noise level increases 
from 10 -5 to 10 —1 , the performance of ES gradually worsens, similar to the other methods 
SS, RSGF, and RP. However, the noise-invariant property of STARS allows it to remain 
robust in these noisy environments. 
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Figure 6.3: Trajectory plots of five zero-order methods in the additive and multiplicative 
noise settings. The vertical axis represents the true function value f(xk), and each line is 
the mean of 20 trials. 
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7 Appendix 


In this appendix we describe the implementation details of the four zero-order methods 
tested in Table 6.1 and Section 6.3. 

Random Search for Stochastic Optimization 


Algorithm 3 (SS: Random Search for Stochastic Optimization) 

1: Choose initial point xq and iteration limit N. Fix step length hk = h = ( ' n+4 )( iV ^. 1 )i/ 2 Lo 
and smoothing stepsize Hk = H Set k 1. 

2: Generate a random Gaussian vector Uk . 

3: Evaluate the function values fixy^k) and f(xk + A L k u k'^k)- 
4: Call the random stochastic gradient-free oracle 

/ t x K x k + MfcWfc; ik) - f(xk 5 &) 

^fx\Xk: ^ki ^k) ^k ■ 

5: Set = Xk — hkS^Xk'iUk^k), update fc fc + 1, and return to Step 2. 


Algorithm 3 provides the SS (Random Search for Stochastic Optimization) algorithm 
from [15]. 

Remark: e is suggested to be 2 -16 in the experiments in [15]. Our experiments in Sec¬ 
tion 6.3, however, show that this choice of e forces SS to take small steps and thus SS does 
not converge at all in the noisy environment. Hence, we increase e (to e = 0.1) to show that 
optimistically, SS will work if the stepsize is big enough. Although in the additive noise 
case one can recover STARS by appropriately setting this e in SS, it is not possible in the 
multiplicative case because STARS takes dynamically adjusted smoothing stepsizes in this 
case. 

Randomized Stochastic Gradient-Free Method 

Algorithm 4 provides the RSGF (Randomized Stochastic Gradient-Free Method) algo¬ 
rithm from [7]. 

Remark: Although the convergence analysis of RSGF is based on knowledge of the con¬ 
stants Li and cr 2 , the discussion in [7] on how to implement RSGF does not reply on these 
inputs. Because the authors solved a support vector machine problem and an inventory 
problem, both of which do not have known L\ and cr 2 values, they provide details on how 
to estimate these parameters given a noisy function. Hence following [7] , the parameter L\ 
is estimated as the I 2 norm of the Hessian of the deterministic approximation of the noisy 
objective functions. This estimation is achieved by using a sample average approximation 
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Algorithm 4 (RSGF: Randomized Stochastic Gradient-Free Method) 

1: Choose initial point xo and iteration limit N. Estimate L\ and <r 2 of the noisy function 
/. Fix step length as 

1 . f 1 D 1 

7 *= 7 = mm \ iwot ’ 

where D = (2f(xo)/Li) 2 . Fix Hk = fi = 0.0025. Set k <— 1. 

2: Generate a Gaussian vector Uk . 

3: Evaluate the function values f(xk',£k) and f(xk + Hk u k'i £k)■ 

4: Call the stochastic zero-order oracle 

^ / j . \ f{ x k T ^k u k‘t Ck) — f{ x ki£k) 

nKp^ki 'U j ki sk) Wfc- 

fl 

5: Set = Xk — "fkG^Xk] Uk, & ), update k <— k + 1, and return to Step 2. 


approach with 200 i.i.d. samples. Also, we compute the stochastic gradients of the objec¬ 
tive functions at these randomly selected points and take the maximum variance of the 
stochastic gradients as an estimate of <r 2 . 

Random Pursuit 


Algorithm 5 (RP: Random Pursuit) 

1 : Choose initial point xo, iteration limit IV, and line search accuracy /j, = 0.0025. Set 
k<r- 1 . 

2 : Choose a random Gaussian vector Uk- 

3: Choose Xk+i = Xk + LSapprox^ {^ki u k) • u k, update k <— k + 1, and return to Step 2. 


Algorithm 5 provides the RP (Random Pursuit) algorithm from [19]. 

Remark: We follow the authors in [19] and use the built-in MATLAB routine fminunc.m 
as the approximate line search oracle. 

(1 + l)-Evolution Strategy 

Algorithm 6 provides the ES ((1 + l)-Evolution Strategy) algorithm from [18]. 

Remark: A problem-specific parameter required by Algorithm 6 is the initial stepsize 
<Jo, which is given in [19] for some of our test functions. The stepsize is multiplied by a 
factor c s = e 1 / 3 > 1 when the mutant’s fitness is as good as the parent is and is otherwise 
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Algorithm 6 (ES: (1 + l)-Evolution Strategy) 

1: Choose initial point xq, initial stepsize <7o, iteration limit N, and probability of improve¬ 
ment p = 0.27. Set c s = e3 « 1.3956 and Cf = c s ■ e ^ w 0.8840. Set k «— 1. 

2: Generate a random Gaussian vector Uk . 

3 : Evaluate the function values /(;&&;£&) and f{ x k + (J k u k\^k)- 

4: If f(x k + c T k u k ;£k ) < /(**;&), then set ^fc+i = + CTfcW/e and cr fc+ i = c s cr fc ; 

Otherwise, set and 

5 : Update A: <— k + 1 and return to Step 2. 


-p 

multiplied by c s • e 1 -^ < 1, where p is the probability of improvement set to the value 0.27 
suggested by Schumer and Steiglitz [18]. 
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